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Abstract. Querying very large RDF data sets in an efficient manner re¬ 
quires a sophisticated distribution strategy. Several innovative solutions 
have recently been proposed for optimizing data distribution with prede¬ 
fined query workloads. This paper presents an in-depth analysis and ex¬ 
perimental comparison of five representative and complementary distri¬ 
bution approaches. For achieving fair experimental results, we are using 
Apache Spark as a common parallel computing framework by rewriting 
the concerned algorithms using the Spark API. Spark provides guaran¬ 
tees in terms of fault tolerance, high availability and scalability which are 
essential in such systems. Our different implementations aim to highlight 
the fundamental implementation-independent characteristics of each ap¬ 
proach in terms of data preparation, load balancing, data replication and 
to some extent to query answering cost and performance. The presented 
measures are obtained by testing each system on one synthetic and one 
real-world data set over query workloads with differing characteristics 
and different partitioning constraints. 


1 Introduction 

During the last few years, an important number of papers have been published 
on the distribution issue in RDF database systems, ca, 0 , m, uni and m 
to name a few. The main motivation of this research movement is the efficient 
management of ever growing size of produced RDF data sets, i.e., repositories of 
hundreds of millions to billions of RDF triples are now more and more frequent. 
Being one of the popular data model of the Big data ecosystem, RDF has to 
cope with issues such as scalability, high availability, fault tolerance. Systems 
addressing these issues, e.g., with NoSQL systems [20], generally adopt a scale- 
out approach consisting of distributing both data storage and processing over a 
cluster of commodity hardware. 

Depending on the data model, it is well-known that an optimal distribution, 
e.g. : in terms of data replication rate, load balancing and query answering per¬ 
formance, may be hard to achieve. Each distribution approach also comes with 
a set of data transformation and processing steps that are more or less intensive. 

Concerning graphs in general, obtaining a balanced partitioning is known 
to be an NP-hard problem. Hence, most systems are proposing heuristic-based 


approaches which tend to produce distribution with interesting properties. In a 
query processing context, one of the supreme properties is the ability to limit 
the amount of data exchanged over the network constituting the cluster. In 
fact, with distributed join processing, a machine may have to transfer a large 
locally computed temporary result to another machine for further processing. In 
such situations, the total duration of the query answering process can largely be 
dominated by the exchange of large data chunks, e.g., hundreds or thousands of 
Gigabytes are not uncommon, over the cluster network. 

The first systems considering distributed storage and query answering for 
RDF data appeared quite early in the history of RDF. Systems like Edutella 
m and RDFPeers |3j were already tackling partitioning issues in the early 
2000s. More recently, systems like YARS2 [T2] and Virtuoso [B] were based on 
hashing one of the RDF triple components, most frequently the subject. In 2011, 
dlj (henceforth denoted nHopDB) was the first attempt to use a graph parti¬ 
tioning approach to fragment an RDF dataset. This system has reinvigorated 
the research community on this topic. Recent systems are either extending the 
graph partitioning approach, e.g., WARP m or are complaining about their 
limitations, e.g., SHAPE |l6j . 

As a consequence of the plethora of distribution strategies, it is not always 
easy to identify the most efficient solution in a given context. The first objective 
of this paper is to clarify this situation by conducting evaluations of leading 
RDF triple distribution algorithms. A second goal is to consider Apache Spark 
as the parallel computing framework for hosting these implementations. This 
is particularly relevant in a context where a large portion of existing RDF dis¬ 
tributed databases, e.g., nHopDB, Semstore [25] , SHAPE IB], SHARD [15], 
have been implemented using Apache Hadoop, i.e., the open source MapReduce 
[3] reference implementation. In [21], limitations of considering MapReduce as 
a database system have been identified, some of them being related to the high 
rate of disk reads and writes. Spark is precisely more efficient, up to 100 times, 
than Hadoop because it tends to work with data stored in the main memory. 

Our experimentation is conducted over a reimplementation of five approaches, 
two hash-based, two based on graph partitioning and an hybrid one. Each sys¬ 
tem is evaluated on two datasets, one synthetic and one real-world, over varying 
cluster settings and on a total of six queries which differ in terms of their shape, 
e.g., star and property chains, and selectivity. We present and analyze experi¬ 
mentations conducted in terms of the time required to prepare the data, load 
balancing, data replication rate and query answering performance. 


2 Background knowledge 
2.1 RDF - SPARQL 

RDF is a schema-free data model that permits to describe data on the Web. It 
is usually considered as the cornerstone of the Semantic Web and the Web of 
Data. Assuming disjoint infinite sets U (RDF URI references), B (blank nodes) 


and L (literals), a triple (s,p,o) £ (U U B) x U x (U U B U L) is called an RDF 
triple with s, p and o respectively being the subject, predicate and object. We 
now also assume that V is an infinite set of variables and that it is disjoint with 
U, B and L. We can recursively define a SPARQlj^] triple pattern as follows: (i) 
a triple tp £ (U U V) x (U U V) x (U U V U L) is a SPARQL triple pattern, 
(ii) if tpi and tp 2 are triple patterns, then (tp\.tp 2 ) represents a group of triple 
patterns that must all match, (tpi OPTIONAL tp 2 ) where tp 2 is a set of patterns 
that may extend the solution induced by tpi, and (tpi UNION tp 2 )j denoting 
pattern alternatives, are triple patterns and (iii) if tp is a triple pattern and 
C is a built-in condition then the expression (tp FILTER C) is a triple pattern 
that enables to restrict the solutions of a triple pattern match according to the 
expression C. The SPARQL syntax follows the select-from-where approach of 
SQL queries. The SELECT clause specifies the variables appearing in the query 
result set. 


2.2 Apache Spark 

Apache Spark 1/51 is a cluster computing framework whose design and imple¬ 
mentation started at UC Berkeley’s AMPlab. Just like Apache Hadoop, Spark 
enables parallel computations on unreliable machines and automatically handles 
locality-aware scheduling, fault tolerance and load balancing tasks. While both 
systems are based on a data flow computation model, Spark is more efficient than 
Hadoop for applications requiring to reuse working datasets across multiple par¬ 
allel operations. This efficiency is due to Spark’s Resilient Distributed Dataset 
(RDD) [24j, a distributed, lineage supported fault tolerant memory abstraction 
that enables one to perform in-memory computations (when Hadoop is mainly 
disk-based). The Spark API also simplifies the programming tasks by integrating 
functions which are not natively supported in Hadoop, e.g ., join, filter. 


2.3 Metis graph partitioner 

Due to the complexity of partitioning a graph in an optimal manner, several 
methods have been defined to propose an approximation, e.g., '7j. These algo¬ 
rithms are generally not efficient for large graphs where a multi-level propagation 
approach is frequently used, be., the graph is coarsened until its size permits to 
use one of the approximate solutions, then it is uncoarsened. The Metis system 
1 15] follows this approach and is known to reach its limits for graphs of about 
half a billion triples. Metis takes as input an unlabeled, undirected graph and 
an integer value corresponding to the desired number of partitions. Its output 
provides a partition number for each node of the graph. nHopDB and WARP 
are two recent systems that are using Metis to partition RDF graphs. 


1 http://www.w3.org/TR/rdf-sparql-query/ 



3 Systems and distributed algorithms 


In this section, we present the main features and design principles of the distri¬ 
bution methods we have selected. We consider four different approaches which 
can be characterized as hash and graph partitioning based. Each category is 
composed of two approaches which have been used in systems and described 
in conference publications. Our fifth system corresponds to an hybrid approach 
that mixes a hash based approach with a replication strategy that enables to 
efficiently process long chain queries. Note that we do not consider systems that 
partition using a ranged-based approach since they are rarely encountered in 
existing systems due to their inefficiency. 


3.1 Hash based approaches 

The two approaches defined in this section correspond to families of RDF database 
systems rather than to specific systems (as in the next section). If not extended in 
a particular manner, these systems do not replicate any triples across partitions. 


Random hashing: In a distributed random hash-based solution, the key on 
which the data partitioning is specified does not correspond to any particular 
data stored in the data model. For instance, the key can correspond to an internal 
triple identifier or to some operations over the entire triple. The former solution 
is the one adopted by the Trinity.RDF system [2B|. These two approaches do not 
require an additional data structure to identify the partition a particular entry 
is stored in. The only elements that are required for directed lookup are the 
hash function and the method to obtain the key. Some other forms of random 
partitioning exist and may require an additional structure for directed lookups 
to cluster nodes where triples are located, e.g. round-robin approach. We do not 
consider such approaches in this work since they do not guarantee nice query 
processing properties for any of the query shapes (star, property chains, tree, 
cycle or hybrid). 


RDF triple element hashing: In this system, the key provided to the hash 
function is one of the elements of RDF triples. The most frequent approach 
is to partition by triple subjects but the object or predicate element can also 
be considered. Partitioning by subject provides the nice property of ensuring 
that star-shaped queries, i.e. queries composed of a graph where only one node 
has an out-degree greater than 1, are performed locally on a given machine. 
Nevertheless they do not provide guarantees for queries composed of property 
chains or complex query patterns. One advantage of this approach is that it 
does not require an additional structure to locate the partition of a given key. 
Systems like Yars2, Virtuoso, Jena ClusteredTDB and SHARD are adopting this 
approach. 


3.2 Graph partitioning based approach 

The hash-based approaches just presented are likely to require a high data ex¬ 
change rate over the network for complex query patterns, i.e. those not corre¬ 
sponding to a star. One way to address this issue is to either organize a replication 
of data and/or to analyze the query workload. Of course, such an organization 
comes at a processing cost which needs to considered with attention. Systems 
corresponding to each of these approaches are considered next. 


nHopDB: The distribution approach presented in (14| is composed of two 
steps. In a first stage, the RDF dataset is transformed such that it can be sent 
to the Metis graph partitioner, i.e., remove properties and undirect the graph 
where subjects and objects are encoded contiguously. Then, Metis’s results are 
translated to triples allocation over the cluster. The partition state obtained at 
the end of stage 1 is denoted as 1-hop. The second stage starts and corresponds 
to an overlap strategy which is performed using a so-called n-hop guarantee. 
Intuitively, for each partition, each leaf l is extended with triples whose subject 
correspond to l. This second stage can be performed several times on the suc¬ 
cessively generated partitions. Each execution increases the n-hop guarantee by 
a single unit. 

m describes an architecture composed of a data partitioner and a set of 
workers corresponding to RDF-3X m database instances. Some queries can be 
executed locally on a single node and thus enjoy all the optimization machinery 
of RDF-3X. For queries where the answer set spans multiple partitions, the 
Hadoop MapReduce system is used to supervise query processing. 


WARP: The WARP system |T3] has been influenced by nHopDB and the 
Partout system [8] (the two authors of WARP also worked on Partout). From the 
former, it borrows the graph partitioning approach and the 2-hop guarantee. Just 
like Partout, it then refines triple allocation by considering the query workload, 

i.e., a set of the most frequently performed queries over this dataset. The system 
considers that this query workload is provided in one way or another. In fact, 
each of these queries are transformed into a set of query patterns. As a result, 
WARP guarantees that some frequent queries can be processed locally without 
exchanging data across machines. For these queries, each partition should contain 
sufficient data such that the result of the query is the union of local results. 
WARP proceeds as follows: 

1. It computes a first data partitioning using the Metis graph partitioner. 

2. It fragments the data in partitions according to the subject value and loads 
each data partition into the independent RDF-3X [dSj management system. 

3. A replication strategy is applied to ensure a 2-hop guarantee. 

4. For each query pattern, WARP computes the number of triples to repli¬ 
cate. To this end, it decomposes the pattern into a set of local sub-queries 
which are all evaluated locally. Each of those sub queries is a candidate to be 
the starting point (called seed query) for the evaluation of the entire query 


pattern. The main idea of WARP is to bring the missing triples into the par¬ 
titions that contains the triples of the seed. To do so, for each candidate and 
partition, it computes the cost of transferring missing triples into the current 
partition. Of course, it selects the seed query candidate that minimizes the 
cost. 

The WARP system implements its own distributed join operator to combine 
the local sub-queries. Locally, the queries are executed using RDF-3X machinery. 

3.3 Hybrid approach 

The design of this original hybrid approach has been motivated by our analysis 
of the WARP system as well as some hash-based solutions. We have already 
highlighted (to be confirmed in the next section) that the hash-based solutions 
require short data preparation times but come with poor query answering per¬ 
formances for complex query patterns. On the other hand, the WARP system 
proposes an interesting analysis of query workloads which is translated into an 
efficient data distribution. Next, we will see that most of data preparation for 
WARP is spent in the graph partitioning stage. Hence, it seems interesting to 
combine a hash-based partitioning with a query workload aware refinement. 

4 Spark system implementations 

4.1 Dataset loading and encoding 

All datasets are first loaded on the cluster’s Hadoop File System(HDFS). In the 
experimentation section, we do not provide measures on this loading stage. We 
can only stress that the loading rate in our cluster averages 520.000 triples per 
second. 

Like in most RDF stores, each dataset is encoded by providing a distinct inte¬ 
ger value to each node and edge of the graph (see |4] Chapter 4 for a presentation 
of RDF triple encoding methods). The computation is completely performed in 
parallel in one step using the Spark framework. We do not provide implementa¬ 
tion details due to space limitation^ The encoded datasets, together with their 
dictionaries (one for the properties and another for subjects and objects) are 
also loaded into HDFS. In all experimentations, the data is loaded within the 
Spark programs from HDFS. 


4.2 Hash-based approaches 

This approach is relatively straightforward in the context of Spark which pro¬ 
vides through its API the methods to partition a dataset. In the case of the 
random-hash partitioning, the system computes the partition of a triple given 

2 Consult http://www-bd.lip6.fr/wiki/doku.php?id=site:recherche:logiciels:rdfdist for 
implementation details. 




its subject, predicate and object, i.e., the key is the triple. In the triple ele¬ 
ment hashing, we specify the subject as a key to the partitioning method. None 
of the implementations are extended to provide any form of replication. The 
query answering evaluation is performed forthrightly following a translation from 
SPARQL to Spark scripts requiring a mix of map, filter, join and distinct 
methods performed over RDDs. 


4.3 Graph partitioning-based approaches 

The two systems in that partitioning category require three Metis related steps: 
preparation, computation and transformation of the results. Because Metis deals 
with unlabeled and undirected graphs, we start by removing predicates from the 
datasets then append the reversed subject/object pairs to the pair set yielding 
thus an undirected graph. Using Metis imposes also limitations in terms of ac¬ 
cepted graph size. Indeed, the largest graph that can be processed contains about 
half a billion nodes. Consequently, we limit our experimentations to datasets of 
at most 250 million RDF triples provided that their undirected transformation 
yields graphs of 500 million nodes. The output of Metis is a set of mapping asser¬ 
tions between a node and its partition. Based on these mappings, we allocate a 
triple to the partition of its subject. In terms of data encoding, we extend triples 
with partition identifiers yielding quads. Note that at this stage, the partition 
identifier can be considered as ’logical’ and not ’physical’ since the data is not 
yet stored on a given machine. We would like to stress that the preparation and 
transformation phases described above are performed in parallel using Spark 
programs. 

Concerning the nHopDB system, the n-hop guarantee is computed over the 
RDD corresponding to generated quads. This Spark program can be executed 
(n-1) times to obtain an n-hop guarantee. 

Intuitively, our WARP implementation analyzes the query workload general¬ 
ization using Spark built-in operators. For instance, consider the following Basic 
Graph Pattern (henceforth BGP) of a query denoted Ql: ?x advisor ?y . ?y 
worksFor ?z . ?z subOrganisation ?t, the system uses the filter operator 
to select the triples that match the advisor, worksFor and subOrganization 
properties. Moreover, the join operator is used to perform join equality predi¬ 
cates on variables y and z. A query result is thus a set of bindings. We extend the 
notion of variable bindings with the information regarding the partition identifier 
of each triple. For instance, an extract of a Ql’s result (in an unencoded read¬ 
able form) is represented as {(Bob,Alice, 1), (Alice, DBteam,3), (DBteam, 
Univl,1)}. 

We see on the extracted result that the triple binding for ?y worksFor ?z is 
{(Alice, DBteam,3)} (respectively "Alice" and "DBTeam" are bound to vari¬ 
ables ?x and ?y) and is located on partition 3 whereas the 2 other triples are 
on partition 1. Thus we can efficiently access it to count the number of triples 
to replicate. For instance, if we consider the seed (?x advisor ?y), we need 
to replicate the triple (Alice, worksFor, DBteam) in partition 1 by copying it 



from partition 3. As specified earlier in section [3T2| we consider all the candidate 
seeds to choose the seed that implies the minimal number of triples to replicate. 

The next step extends the partitions with replicates. This is relatively straight¬ 
forward using Spark’s union operator. 

Finally, for querying purpose, each query is extended with a predicate en¬ 
forcing local evaluation by joining triples with the same partition identifier. 


4.4 Hybrid approach 

This approach is mixing the subject-based hashing method with the WARP 
workload-aware processing. Hence, using our standard representations of triples 
and quads together with Spark’s ability to easily handle data transformations 
made our coding effort for this experiment relatively low. 

5 Experimental setting 

5.1 Datasets and queries 

In this evaluation, we are using one synthetic and one real world dataset. The 
synthetic data set corresponds to the well-established LUBM 9j. We are using 
three instances of LUBM, denoted LUBM1K, LUBM2K and LUBM10K which 
are parameterized respectively with 1000, 2000 and 10000 universities. The real 
world data set consists in Wikidata [22], a free collaborative knowledge base 
which will replace Freebase [2] in 2015. Table [2] presents the number of triples 
as well as the size of each of these data sets. 


Data set 

#triples 

nt File Size 

LUBM IK 

133 M 

22 GB 

LUBM 2K 

267 M 

43 GB 

LUBM 10K 

1,334 M 

213 GB 

Wikidata 

233 M 

37 GB 


Table 1. Dataset statistics of our running examples 


Concerning queries, we have selected three SPARQL queries from LUBM 
(namely queries #2, #9 and #12 respectively denoted Q2, Q3 and Q4) and we 
have created an additional one (denoted Ql) which requires a 3-hop guarantee to 
be performed locally on the nHopDB, WARP and hybrid implementations. To 
complement the query evaluation, we have created two queries for the Wikidata 
experiments, resp. Q5 and Q6. The first one takes the form of a 3-hop property 
chain query that shows to be much more selective than the LUBM ones, the 
second one is shaped as a simple star and was motivated by the absence of such 
a form in our query set. All six queries are presented in Appendix [A] 









5.2 Computational environment 


Our evaluation was deployed on a cluster consisting of 21 DELL PowerEdge 
R410 running a Debian distribution with a 3.16.0-4-amd64 kernel version. Each 
machine has 64GB of DDR3 RAM, two Intel Xeon E5645 processors each of 
which is equipped with 6 cores running at 2.40GHz and allowing to run two 
threads in parallel (hyperthreading). Hence, the number of virtual cores amounts 
to 24 but we used only 15 cores per machine. In terms of storage, each machine is 
equipped with a 900GB 7200rpm SATA disk. The machines are connected via a 
lGB/s Ethernet Network adapter. We used Spark version 1.2.1 and implemented 
all experiments in Scala, using version 2.11.6. The Spark setting requires that 
the total number of cores of the cluster to be specified. Since in our experiments 
we considered clusters of 5, 10 and 20 machines respectively, we had to set the 
number of cores to 75, 150 and 300 cores respectively. 


6 Experimentation 

Since we could not get any query workloads for Wikidata, it was not possible 
to conduct experimentations with WARP and the hybrid approach over this 
datasets. Moreover, since Metis is limited to datasets of half a million edges, it 
was not possible to handle nHopDB and WARP over LUBM10K. Given the fact 
that the hybrid system relies on subject hashing, and not Metis, it was possible 
to conduct this experimentation over LUBM10K for that system. 


6.1 Data preparation 

Figure [l] presents the data preparation processing times for the different sys¬ 
tems. As one would expect, the hash-based approaches are much more efficient 
than the graph partition-based approaches, between 6 and 30 times faster de¬ 
pending on the number partitions. This is mainly due to the fact that Metis 
runs on a single machine (we have not tested parMetis, parallelized version of 
Metis) while the hash operations are being performed in parallel on the Spark 
cluster. The evaluation also emphasizes that the hybrid approach presents an 
interesting compromise between these distribution method families. By evaluat¬ 
ing the different processing steps in each of the solutions, we find out that, for 
hash-based approaches, around 15% of processing time is spent on loading the 
datasets whereas the remaining 85% of time is spent on partitioning the data. For 
the graph partitioning approaches, 85 to 90% corresponds to the time spent by 
Metis for creating the partitions; the durations increase with the larger dataset 
sizes. This explains that the time spent by graph partitioning approaches are 
slightly increasing even when more machines are added. This does not apply for 
the other solutions where more machines lead to a reduction of the preparation 
processing time. 
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Fig. 1 . Data preparation times 


6.2 Balanced storage 

Load balancing is an important aspect when distributing data for storage and 
querying purposes. In Figure [2j we present the standard deviations (in log 
scale) for our different systems. For the graph partitioning-based and hybrid 
approaches, we only consider the standard deviation of the partition sizes at 
the end of the partitioning process, *.e., Metis partitioning and n-hop guarantee 
application. 

The two hash-based approaches and the hybrid approach are the best solu¬ 
tions and are close to each other. This is rather obvious since the hash partition¬ 
ing approaches are concentrating on load balancing while a graph partitioner 
tries to reduce the number of edges cut during the fragmentation process. The 
hybrid approach is slightly less well-balanced due to the application of the WARP 
query workload-aware strategy. The random-based hashing has 5 to 12% less de¬ 
viation than subject hashing. This is due to high degree nodes that may increase 
the size of some partitions. Although ranging in similar standard deviation val¬ 
ues, the nHopDB approach is the less efficient of the graph partitioning solutions. 
We believe that this is highly related to the number of queries one considers in 
the query workload. We consider that further analysis needs to be conducted on 
real world datasets and query workloads to confirm these nevertheless interesting 
conclusions. 
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Fig. 2. Standard deviation 


6.3 Data replication 


Intrinsically, all solutions present some node replications since a given node can 
be an object in one partition and a subject in another one. This corresponds to 
the 1-hop guarantee that ensures validity of data. In this section, we are only 
interested in triple replication. Only the nHopDB, WARP and hybrid solutions 
present such replications. 

Table [2] provides the replication rates for each of these systems for the LUBM 
IK and 2K datasets. Several conclusions can be drawn from this table. First, 
Metis-based approaches are more efficient than the subject-hashing of the hybrid 
system. Remember that by minimizing edge cut, a graph partitioner groups the 
nodes that are close to each other in the input graph. Secondly, the more par¬ 
titions the cluster contains, the more overall replication one obtains. The n-hop 
guarantee replicates less than the query workload-aware method of WARP. Fi¬ 
nally, we can stress that the replication of the hybrid approach can be considered 


quite acceptable given the data replication duration highlighted in Section 6.1 


6.4 Query processing 

In order to efficiently process local queries and to fairly support performance 
comparison in a distributed setting, we must use the same computing resources 
for local and distributed runs. A local query runs in parallel when every machine 




























Part, scheme 

nHopDB 

WARP 

Hybrid 

Data set 

5 part. 

10 part. 

20 part. 

5 part. 

10 part. 

20 part. 

5 part. 

10 part. 

20 part. 

LUBM IK 

0.12 

0.16 

0.17 

0.26 

0.54 

0.57 

0.54 

1.33 

1.84 

LUBM 2K 

0.12 

0.16 

0.18 

0.34 

0.52 

0.54 

0.54 

1.33 

1.94 


Table 2. Replication rate comparison for three partitioning schemes and three cluster 
sizes 


only has to access its own partition. To exploit the multicore machines on which 
we perform our experiments, it is interesting to consider not only inter-partition 
parallelism but intra-partition parallelism as well. Unfortunately, intra-partition 
parallelism is not fully supported in Spark since a partition is the unit of data 
that one core is processing. Thus, to use 15 cores on a machine, we must split 
a partition into 15 sub partitions. Spark does not allow to specify that such 
sub-partitions must reside together on the same machine. We expect that future 
version of Spark will allow such control. In the absence of any triple replication, 
the hash-based solutions are not impacted by this limitation. This is not the 
case for the systems using replication. For instance, for the two query workload- 
aware solutions (i.e., WARP and hybrid), we conducted our experiment using a 
workaround that forces Spark to use only one machine for one partition: for local 
queries, we run Spark with only one slave machine. Then we load only the data 
of one partition and process the query locally in parallel using all the cores. To 
be fair and take into account the possibility that a local query might run faster 
in some partitions than in some other partitions, we repeat the experiment for 
every partition and report the maximum response time. 

The case of nHopDB is more involved and requires to develop a special dedi¬ 
cated query processor, specialized for Spark, to fully benefit from the data frag¬ 
mentation. In a nutshell, that system would have to combine intra and inter¬ 
partition query processor. The former would run for query subgraphs that can 
run locally and the second one would perform joins over all partitions with re¬ 
trieved temporary results. Since the topic of this paper concerns the evaluation 
of distribution strategies, we do not detail the implementation of such a query 
processor in this work and hence we do not present any results for the nHopDB 
system. 

Table [3] presents the query processing times for our dataset. Due to space 
limitation, we only present the execution time obtained over the 20 partitions 
experiment. The web site companion (see P]) highlights that the more parti¬ 
tions the more efficient the query processing. The table clearly highlights that 
the WARP systems are more efficient than the hashing based solutions. Ob¬ 
viously, the simpler the query, e.g ., Q4 and Q6, run locally while the others 
require inter-partition communication. With the Spark version (i.e., 1.2.1) we 
were conducting this experiment on, we could not measure the inter node in¬ 
formation communication. In fact, Spark’s shuffle read measure indicates the 
total information exchange (locally on a node and globally over the network). 
















Fig. 3. Query Evaluation on 20 partitions 


7 Related work 


Some other interesting works have recently been published on the distribution 
of RDF data. Systems such as Semstore [23J and SHAPE m take some original 
position. Instead of using the common query workload, Semstore divides a com¬ 
plete RDF graph into a set of paths which cover all the original graph nodes, 
possibly with node overlapping between paths. These paths are denoted Rooted 
Sub Graph (RSG in short) since they are generated starting from nodes with 
a null in-degree, be., roots, to all their possible leaves. A special workaround is 
used to handle cycles that may occur at the root position, be., cycles that are 
not reachable from any root. The idea is then to regroup these RSG into differ¬ 
ent partitions. This is obviously a hard problem for which the authors propose 
an approximated solution. Their solution uses the K-means clustering approach 
which regroups RSG with common segments together in the same partition. A 
first limitation of this approach is the high dimensionality of the vectors handled 
by the K-means algorithm, be., the size of any vector corresponds to the number 
of nodes in the graph. A second limitation is related to the lack of an efficient 
balancing of the number triples across the partitions. In fact, the system operates 
at the coarse-grained level of RSG and provides a balancing at this level only. 
Semstore is finally limited in terms of join patterns. It can efficiently handle S-0 
(subject-object) and S-S (subject-subject) join patterns but other patterns, such 
as 0-0 (object-object) may require inter node communication. 

The motivation of the SHAPE system is that graph partitioning approaches 
do not scale. Just like in our hybrid solution, they propose to replace the graph 
partitioning step by a hash partitioning one. Then, just like in the nHopDB 
system, they replicate according to the n-hop guarantee. Hence, they do not 
consider any query workload and take the risk of inter-partition communication 
for long chain queries longer than their n-hop guarantee. 





























8 Conclusions and perspectives 


This paper presents an evaluation of distributed systems ranging over two im¬ 
portant partitioning categories: hashing and graph partitioning. The choice of 
using the Spark framework is motivated by its high performance. For certain op¬ 
erations, it is considered to be 100 times faster than Hadoop MapReduce. While 
several systems have been designed on top of Hadoop, we are not aware of any 
RDF data management systems running on top of Spark. The main motivation 
of the experiments is that existing partitioning solutions do not scale gracefully 
to several billion triples. For instance, the Metis partitioner is limited to less 
than half a billion triples and SemStore (cf. related works section) relies on K- 
Means clustering of vectors whose dimension amount to the number of nodes of 
the data to be processed (be., 32 millions in the case of LUBM1K). Computing 
a distance at such high dimension is currently not possible within Spark, even 
when using sparse vectors. Moreover, applying a dimension reduction algorithm 
to all the vectors is not tractable. 

The conclusion of our experiment is that basic hash-based partitioning solu¬ 
tions are viable for distributed RDF management: they come at no preparation 
cost, be., it only requires to load the triples into the right machine, and it is fully 
supported by the underlying Spark system. As emphasized by our experimenta¬ 
tion, Spark scales out to several billion triples by simply adding extra machines. 
Nevertheless, without any replication, these systems may hinder availability and 
reduce the parallelism of query processing. They also involve a lot of network 
communications for complex queries which require to retrieve data from many 
partitions. Nonetheless, by making intensive use of main memory, we believe that 
Spark provides a high potential for these systems. Clearly, with the measures we 
have obtained in this evaluation, we can stress that if one needs a fast access to 
large RDF datasets and is, to some extent, ready to sacrifice the performance of 
processing complex query patterns then the hash-based solution over Spark is a 
good compromise. 

Concerning the nHopDB and WARP approaches, we consider that using 
Metis is an important drawback. Based in these observations, we investigated the 
hybrid candidate solution which does not involve a heavy preparation step and 
retains the interesting query workload aware replication strategy. This approach 
may be particularly interesting for data warehouses where the most common 
queries (materialized views) are well identified. With this hybrid solution we may 
get the best of worlds, the experiments clearly emphasize that the replication 
overhead compared to the pure WARP approach is marginal but the gain in 
data preparation is quite important. 

Concerning Spark, we highlighted that it can process distributed RDF queries 
efficiently. Moreover, the system can be used for the two steps : data preparation 
and query processing in an homogeneous way. Rewriting SPARQL queries into 
the Scala language (supported by Spark) is rather easy and we consider that 
there is room for optimization. The next versions of Spark which are supposed 
to provide more feedback on data exchange over the network should help fine- 
tune our experiments and design a complete production-ready system. 
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A Queries 

We now present the six queries that have been used during our evaluation. 

Ql: SELECT ?x ?y ?z WHERE {?x lubm:advisor ?y. ?y lubm:worksFor ?z. 
?z lubm:subOrganisation ?t.} 

Q2: SELECT ?x ?y ?z WHERE {?x rdf:type lubm:GraduateStudent. 

?y rdf:type lubm:University. ?z rdf:type lubm:Department. 

?x lubm:memberOf ?z. ?x lubm:subOrganizationOf ?y. 

?x lubm:undergraduateDegreeFrom ?y} 

Q3: SELECT ?x ?y ?z WHERE {?x rdf:type lubm:Student. 

?y rdf:type lubm:Faculty.?z rdf:type lubm:Course. ?x lubm:advisor ?y. 
?y lubm:teacherOf ?z. ?x lubm:takesCourse ?z} 

Q4: SELECT ?x ?y WHERE {?x rdf:type lubm:Chair. 

?y rdf:type lubm:Department. ?x lubm:worksFor ?y. 

?y lubm:subOrganizationOf <http://www.UniversityO.edu>} 

Q5: SELECT ?x ?y ?z WHERE {?x entity:P131s ?y. ?y entity:P961v> ?z. 
?z entity:P704s ?w.} 

Q6: SELECT ?x ?y ?z WHERE {?x entity:P39v ?y. ?x entity:P580q ?z. ?x 
rdf:type ?w} 


